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Abstract 

Identifying  documents  that  contain  timely  and  vi¬ 
tal  information  for  an  entity  of  interest,  a  task 
known  as  vital  filtering ,  has  become  increasingly 
important  with  the  availability  of  large  document 
collections.  To  efficiently  filter  such  large  text 
corpora  in  a  streaming  manner,  we  need  to  com¬ 
pactly  represent  previously  observed  entity  con¬ 
texts,  and  quickly  estimate  whether  a  new  doc¬ 
ument  contains  novel  information.  Existing  ap¬ 
proaches  to  modeling  contexts,  such  as  bag  of 
words,  latent  semantic  indexing,  and  topic  mod¬ 
els,  are  limited  in  several  respects:  they  are  un¬ 
able  to  handle  streaming  data,  do  not  model  the 
underlying  topic  of  each  document,  suffer  from 
lexical  sparsity,  and/or  do  not  accurately  estimate 
temporal  vitalness.  In  this  paper,  we  introduce 
a  word  embedding-based  non-parametric  repre¬ 
sentation  of  entities  that  addresses  the  above  limi¬ 
tations.  The  word  embeddings  provide  accurate 
and  compact  summaries  of  observed  entity  con¬ 
texts,  further  described  by  topic  clusters  that  are 
estimated  in  a  non-parametric  manner.  Addition¬ 
ally,  we  associate  a  staleness  measure  with  each 
entity  and  topic  cluster,  dynamically  estimating 
their  temporal  relevance.  This  approach  of  using 
word  embeddings,  non-parametric  clustering,  and 
staleness  provides  an  efficient  yet  appropriate  rep¬ 
resentation  of  entity  contexts  for  the  streaming 
setting,  enabling  accurate  vital  filtering. 


1.  Introduction 

To  build  up  to  date  entity  profiles  from  streaming  text  cor¬ 
pora,  we  need  to  find  references  to  entities  of  interest,  and 
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study  the  information  trends  over  time.  Unfortunately,  this  is 
an  incredibly  difficult  task  to  perform  efficiently  on  streams, 
and  thus,  a  large  number  of  pertinent  articles  are  seldom  re¬ 
trieved  by  automated  approaches.  For  example, [Frank  et  aT] 
(2012)  observe  a  considerable  lag  between  the  publication 
date  of  articles  and  the  date  of  their  citations  in  Wikipedia. 
The  median  time  is  over  a  year,  and  the  distribution  has  a 
long  and  heavy  tail.  This  gap  can  be  drastically  reduced  if 
automated  systems  can  accurately  and  efficiently  suggest 
relevant  documents  for  entities  of  interest  to  editors  as  soon 
as  they  are  published. 


The  Knowledge  Base  Acceleration  (KBA)  track  at  TREC 
addresses  this  task.  Recent  submissions  to  KBA  ([Liu  et  al.[ 


2013 

Bouvier  &  Bellot,  2013;  Efron  et  al.[  2013;  Zhang 

et  al. 

2013,  Bellogin  et  al.  2013)  focus  on  solving  the 

aforementioned  problems  with  supervised  methods,  using 
mainly  document,  document-entity,  and  temporal  level  fea¬ 
tures.  They  are,  however,  somewhat  limited:  they  depend 
heavily  on  labeled  data,  do  not  handle  lexical  sparsity  in 
contexts  appropriately,  and  further,  do  not  model  the  various 
semantic  topics  in  the  references. 


In  this  submission,  we  introduce  a  semi- supervised  approach 
suitable  for  streaming  settings  that  uses  word  embedding 
clusters  and  temporal  relevance  to  represent  entity  contexts. 
In  particular,  the  word  embeddings  provide  low-dimensional 
yet  accurate  summaries  of  previously  observed  entity  con¬ 
texts,  and  our  algorithm  updates  the  topic  clusters,  number 
of  topics,  and  the  entities  and  topics  temporal  relevance  in  an 
online  fashion,  observing  only  a  single  document  at  a  time. 
We  use  this  representation  of  unlabeled  documents  as  fea¬ 
tures  in  a  supervised  classifier  to  utilize  labeled  data.  This 
combination  of  word  embeddings,  non-parametric  cluster¬ 
ing,  and  temporal  relevance  (staleness)  provides  an  efficient 
yet  accurate  representation  of  entity  contexts  that  can  be  up¬ 
dated  in  a  streaming  manner,  thus  addressing  the  document 
filtering  requirements  on  large  streams  of  text.  We  present 
experimental  results  that  demonstrate  the  benefits  of  our 
method  and  show  our  performance  on  the  TREC  KBA  2014 
Vital  Filtering  task.  As  part  of  the  Accelerate  and  Create 
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task,  we  also  describe  an  exploratory  tool  for  efficient  and 
intuitive  visualization  of  large  streams. 

2.  Vital  Filtering  Task 


address  sparsity  and  generalization  (Section [371]),  (2)  repre¬ 
sent  the  entity  context  using  non-parametric  topic  cluster¬ 
ing  (Section [3^2]),  and  (3)  estimate  the  novelty  of  the  docu¬ 
ment  information  using  a  staleness  measure  (Section [33]). 


In  this  section,  we  formalize  the  problem  setup  and  intro¬ 
duce  our  notation.  We  assume  a  set  of  m  target  entities 
E  =  {ei, ...,  em}.  We  further  assume  a  set  of  n  documents 
D  =  {di, ...,  dn}  that  arrive  in  chronological  order. 


Each  document  is  a  sequence  of  sentences  composed  by 
collections  of  words,  annotated  with  NLP  tools.  Further,  we 
assume  w.l.o.g.  that  every  document  in  D  refers  to  a  single 
entity  e  E  f.  Since  our  focus  here  is  to  distinguish  vital 
and  non- vital  references,  we  use  a  naive  classifier  based 
resolution  to  identify  the  documents  relevant  (or  referent)  to 
each  entity  (details  in  Section [6~4]),  although  in  practice  this 
is  a  challenging  task  and  more  sophisticated  techniques  are 
required  (|Rao  et  al.]  |2010[  |Singh  et  ah]  |2011|) . 


A  mention  to  e  in  a  document  di  G  D  is  identified  by  a 
string  matching  algorithm  that  searches  for  exact  matches 
of  canonical  and  surface  form  names  of  the  entity  e.  We 
represent  each  di  as  a  compound  of  a  timestamp  tL  and  a 
bag  of  words  Wi  =  {wi i,  ...,Wip}  located  in  the  context 
of  (and  including)  mentions  to  the  entity  e.  Finally,  we 
assume  an  online  setting,  i.e.  the  algorithm  should  provide 
predictions  for  documents  arriving  at  time  t  before  seeing 
any  documents  arriving  at  time  t  +  1. 


Given  this  setup,  the  vital  filtering  task  requires  classifica¬ 
tion  of  each  document  d  relevant  to  an  entity  e  as  follows: 


•  Vital  if  the  document  contains  information  that,  at  the 
time  it  enters  the  stream,  would  cause  an  update  to  the 
entity  e  with  timely,  new  information  about  the  entity 
current  state,  actions  or  situation,  e.g.  “Barack  Obama 
has  been  elected  President”. 

•  Non-Vital  if  the  document  is  relevant,  but  contains 
information  that  is  not  timely,  i.e.  it  may  contain  in¬ 
formation  relevant  when  building  an  initial  profile  of 
the  entity  e,  but  does  not  contain  information  that  an 
accurate,  updated  profile  would  not  have,  e.g.  “Barack 
Obama  was  born  on  August  4th,  1961”. 


3.  Proposed  Approach 


3.1.  Document  Embeddings 


To  identify  whether  an  entity  context  in  a  document  contains 
novel  information,  or  even  if  it  is  relevant  for  the  entity,  we 
need  a  structured  representation  of  the  context.  A  common 
solution  to  this  problem  is  to  use  vector  space  models,  often 
the  Bag  of  Words  (BOW)  models,  where  a  document  is 
represented  as  the  bag  of  its  words,  disregarding  grammar 
and  even  word  order.  Unfortunately,  vector  space  models 
are  often  too  sparse  to  represent  fine-grained  information  in 
contexts,  for  example,  straightforward  BOW  representations 
will  have  minimal  overlap  between  “Barack  was  elected 
president  today”  and  “Obama  has  won  the  election”,  treating 
the  other  as  novel  information  even  after  having  seen  one 
of  them.  Further,  the  size  of  BOW  representations  grows 
over  time  when  the  vocabulary  is  not  predefined  beforehand, 
which  is  a  problem  in  streaming  settings. 


In  order  to  address  these  concerns,  we  propose  to  represent 
contexts  of  entities  in  documents  using  word  embeddings. 
A  word  embedding  is  a  dense,  low-dimensional,  and  real¬ 
valued  vector  associated  with  every  word  in  a  vocabulary 
such  that  they  capture  useful  syntactic  and  semantic  prop¬ 
erties  of  the  contexts  that  the  word  appears  in.  The  low- 
dimensionality  of  the  embeddings  as  compared  to  vector 
space  models  (hundreds  instead  of  millions)  make  them  an 
elegant  solution  to  address  lexical  sparsity  in  settings  with 
very  few  labels  ( Turian  et  al.  2010),  and  further,  they  can  be 
efficiently  trained  on  massive  corpora.  Many  of  the  syntactic 
patterns  can  be  represented  with  simple  algebraic  operations. 
For  example,  the  result  of  vparis  -  vfrance  +  Vgermany  is 
closer  to  Vberun  than  to  any  other  word  vector  ([Mikolov 
[etaLj|2013a|bl). 


We  introduce  a  function  /  :  w  vw  G  Rd  that  defines  the 
pre-computed  word  embedding  representation  of  the  word 
type  dD  To  define  the  embedding  for  a  set  of  words  W,  we 
define  g  :  W  vw  £  Rd  that  computes  embedding  as: 


g(W)  =vW  =  ]±-J2  /H 
I  I  wew 


Given  a  stream  of  documents  D  that  refer  to  entity  e,  the 
task  at  hand  is  to  predict  whether  each  document  is  vital 
or  non-vital  to  e.  To  detect  whether  a  document  contains 
novel  information,  one  needs  to  provide  an  accurate  and 
generalizable  representation  of  historical  contexts,  while 
also  capturing  the  temporal  dynamics  of  the  references. 

To  this  end,  we  propose  a  three-pronged  solution:  (1)  rep¬ 
resent  documents  with  low-dimensional  embeddings  that 


Given  the  document  di  G  D  that  refers  to  entity  e  and 
contains  the  words  Wi  £  Wi,  we  compute  its  vector  repre¬ 
sentation  using  function  g  as  follows: 

vdi  =  vWi  =  g(Wi )  (2) 

With  this,  we  intend  to  capture  the  context  where  the  entity 

*We  use  the  300-dimensional  word  embeddings  trained  on  the 
Google  News  Corpus,  available  at  https  :  /  /code  .  google 
com/p /word2vec/ 
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e  is  mentioned  in  a  document,  i.e.  the  topic,  and  represent  it 
with  a  dense,  low-dimensional  vector. 

Further,  it  may  be  useful  to  separately  capture  the  context  in 
terms  of  different  parts  of  speech.  Let  Win  denote  the  set  of 
all  common  nouns  in  Wz,  WlN  the  set  of  the  proper  nouns, 
and  Wiv  the  set  of  all  verbs  in  Wi,  where  Win  U  WiN  U 
Wiv  =  Wi.  We  compute  the  embedding  vectors  of  all  the 
common  nouns,  proper  nouns,  and  verbs  that  appear  in  the 
context  of  entity  e  using  function  g,  as: 

Vdin  =  vWin  =  g(Win)  (3) 

VdiN  =  vwiN  =  g(WiN)  (4) 

Vdiv  =  Vwiv  =  (5) 

Computing  separate  embeddings  for  different  word  types  is 
a  flexibility  our  method  provides  that  may  better  encapsulate 
the  underlying  content  of  the  document. 


3.2.  Non-parametric  Clustering 

Although  word  embeddings  capture  the  context  around  a 
single  topic  quite  accurately,  they  are  unable  to  represent 
the  variety  of  topics  that  an  entity  may  be  mentioned  in. 
For  example,  the  context  around  Obama  during  elections 
is  quite  different  from  the  context  for  presidential  speech 
or  international  visit.  Using  a  single  word  embedding  to 
represent  multiple  such  topics  may  result  in  embeddings 
that  conflate  them,  i.e.  a  single  embedding  is  inaccurate  for 
representing  multiple  topics. 

One  typical  approach  to  tackle  this  problem  is  using  topic 
models  (|Blei[|2012]).  Such  models  can  be  trained  in  an  of¬ 
fline  manner  over  a  large  corpus,  followed  by  streaming 
inference  for  each  document.  However,  the  number  of  top¬ 
ics  often  needs  to  be  decided  apriori,  which  is  quite  difficult 
to  specify  for  each  entity  of  interest  (non-parametric  ap¬ 
proaches  to  LDA  are  quite  expensive).  Further,  drift  over 
time  can  make  the  topic  distributions  obsolete.  Finally,  it  is 
difficult  to  learn  per  entity  topic  distributions,  especially  if 
some  of  the  entities  have  very  few  relevant  documents. 

Instead  of  representing  the  context  using  only  a  single  em¬ 
bedding,  we  propose  to  use  a  number  of  embeddings  that 
capture  the  different  topic  clusters  of  the  entity,  retaining 
the  advantages  of  using  embeddings  while  still  having  a 
precise  context  representation.  We  assume  that  the  con¬ 
text  in  a  single  document  belongs  to  a  single  topic,  though 
we  dynamically  estimate  the  number  of  topic  clusters  in  a 
non-parametric  manner.  As  we  are  concerned  with  a  stream¬ 
ing  setting,  topic  clusters  evolve  over  time,  i.e.  identities, 
members  and  number  of  clusters  change  over  time. 

We  represent  each  topic  cluster  by  the  mean  embedding 
vector  of  the  documents  assigned  to  that  cluster  at  a  certain 
timestamp.  More  precisely,  the  vector  representation  of  the 
j- th  topic  cluster  at  timestamp  ti,  c- ,  can  be  computed  using: 


e=Obama 


Figure  1 :  Example  of  Non-parametric  Clustering 


m 


Y Vd 

deD3. 


(6) 


where  D\  is  the  subset  of  all  the  documents  that  belong  to 
cluster  j  at  timestamp  ti,  and  V  dq  G  D\  ,tq  <  ti. 

The  number  of  topic  clusters  for  the  context  of  entity  e  is 
unknown  beforehand.  Initially,  we  let  the  entity  context  to 
have  zero  topic  clusters.  We  create  the  first  topic  cluster 
for  the  entity  context  when  the  first  relevant  document  is 
observed.  For  any  following  relevant  document  d,  the  topic 
clusters  are  updated  as  follows.  We  first  compute  a  distance 
of  Vd  with  every  existing  topic  cluster.  If  the  minimum 
distance  to  any  topic  cluster  is  greater  than  or  equal  to  a 
(0  <  a  <  1),  we  create  a  new  topic  cluster  just  containing 
document  d,  otherwise  we  merge  document  d  into  the  clos¬ 
est  cluster  to  Vd,  and  update  the  cluster  vector  representation. 
Our  approach  is  closely  related  to  the  online  non-parametric 
clustering  procedure  described  in  Neelakantan  et  al.|([2014|). 

More  formally,  V  c-_l9  at  time  i,  document  di  is  added  to  the 
topic  cluster  that  solves  the  following  optimization  problem: 


argmin  dist(vd-,v  j  ) 
j  1 

subject  to  dist{ydi ,  uj  )  <  a 
where  dist(-:  •)  is  the  cosine  distance  defined  as: 


dist(x ,  y)  ml  —  cos(x,  y)  =  1  — 


x  •  y 


(7) 


(8) 


The  j- th  topic  cluster  at  time  i  is  updated,  and  therefore 
composed  by  the  subset  of  documents  D\  CD,  where 
D\  =  DJi_1  U  {di}.  Note  that  the  cluster  center  is  updated 
in  constant  time  by  incrementally  maintaining  the  sum  of 
the  member  embeddings. 

Figure [ljillustrates  an  example  of  such  clustering,  using  two- 
dimensions  to  represent  the  vectors.  Let’s  assume  document 
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d\  appears  in  the  stream  first,  and  mentions  the  days  Barack 
Obama  was  a  senator.  As  it  is  the  first  document  referring  to 
the  entity  Obama,  we  add  a  new  topic  cluster  senator  with 
vector  Vdx  •  Then,  document  d 2  appears  in  the  stream,  and 
refers  to  Obama  as  being  elected  President  of  the  United 
States.  The  distance  with  the  previous  cluster  senator  is 
greater  than  a  due  to  semantic  difference  in  the  words, 
therefore  the  algorithm  proceeds  to  create  a  new  topic  cluster 
president  centered  at  Vd2 .  Finally,  d%  enters  the  stream.  It 
talks  about  Obama  as  the  current  President  of  the  U.S.  The 
algorithm  compares  its  distance  to  the  previous  clusters  and 
finds  that  it  is  closest  to  the  president  cluster.  The  distance 
is  less  than  a ,  hence  it  adds  d%  to  the  president  cluster  and 
updates  the  cluster  center. 

3.3.  Staleness 

We  have  been  concerned  with  detecting  whether  a  document 
d  contains  a  novel  context  in  terms  of  the  documents  seen 
so  far.  By  representing  the  context  of  an  entity  as  a  set  of 
topic  clusters,  each  with  an  embedding  vector,  we  are  able 
to  accurately  summarize  the  entity  context  information.  We 
expect  that  documents  that  are  not  close  to  existing  clusters 
contain  novel  information.  Unfortunately,  this  representa¬ 
tion  ignores  the  timeliness  of  the  information,  and  it  is  quite 
possible  that  a  document  that  is  similar  to  existing  clusters 
contains  novel  information.  For  example,  when  a  document 
describes  Obama  victory  in  an  election,  it  may  be  assigned 
to  an  existing  cluster  describing  a  previous  election  he  won, 
nonetheless  it  actually  contains  new  information. 

A  potential  solution  is  to  keep  track  of  when  the  last  doc¬ 
ument  was  assigned  to  a  cluster,  however,  KBA  challenge 
requires  all  documents  that  contain  novel  information  within 
a  time  frame  to  be  marked  vital  as  per  the  timeliness  of  the 
document.  Such  timeliness  is  a  subjective  interpretation  that 
can  vary  per  entity  and  event.  As  an  example  let’s  assume 
that  several  documents  talk  about  an  event  that  happened 
to  entity  e.  During  a  “short”  time  frame  (here  is  where  the 
subjective  interpretation  comes  in)  that  information  can  be 
considered  new.  After  a  while,  that  new  information  transi¬ 
tions  to  a  background  state,  so  as  the  documents  transition 
from  being  vital  to  non-vital. 

In  order  to  address  such  temporal  dynamics  that  capture 
novelty  and  transition  documents  from  a  vital  to  a  non-vital 
state,  we  propose  a  dynamic  staleness  measure  A^,  0  <  A*  < 
1.  This  staleness  measure  can  be  used  both  for  entities  and 
topic  clusters.  Low  staleness  of  the  assigned  entity/cluster 
represents  vital  documents,  while  high  staleness  intends  to 
represent  non-vital  ones. 

The  staleness  of  an  entity/cluster  at  any  time  t  depends  on 
the  staleness  and  the  time  of  the  last  document  dj  assigned 
to  the  entity/cluster.  The  staleness  decay  rate  is  exponential, 


Figure  2:  Staleness  of  Unpopular  Entity 


Figure  3:  Staleness  of  Entity  with  Fluctuating  Popularity 

and  is  controlled  by  the  hyperparameter  7 dec' 

Xt  =  A j  exp  (-jdeJ—^1)  (9) 

where  7 dec  >  0,  tj  and  A  j  are  the  timestamp  and  staleness 
of  the  last  document  assigned  to  the  entity/cluster,  and  T  is 
a  constant  (used  to  transform  the  units  of  time). 

When  a  new  document  di  is  assigned  to  an  entity/cluster 
at  time  U,  we  can  estimate  the  staleness  of  the  en¬ 
tity/cluster  at  that  time  using  the  above  equation,  Xti  = 
\i-i  exp  (—7 dect'~^~1  )•  This  staleness  can  be  used  to  es¬ 
timate  the  novelty  of  the  information  in  di,  i.e.  a  low  Xti 
suggests  the  document  contains  information  that  has  not 
been  observed  for  a  while. 

Thereafter,  since  we  have  just  observed  a  relevant  document 
for  the  entity/cluster,  we  need  to  increase  its  staleness.  We 
use  a  simple  interpolation  to  increase  it: 

A*  =  l-7mc(l-AtJ  (10) 

where  0  <  7  inc  <  1.  The  staleness  for  the  entity /cluster 
is  now  A*,  which  is  used  when  the  next  document  di+i  is 
observed. 

Figure  [2]  illustrates  an  example  of  an  entity  with  a  decreas¬ 
ing  staleness.  There  are  almost  no  documents  referring  to 
the  entity.  As  soon  as  some  activity  is  detected,  i.e.  a  docu¬ 
ment  mentioning  the  entity  appears  (t  =  10),  the  staleness 
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increases  slightly.  Given  the  fact  that  there  is  not  much  in¬ 
formation  about  the  entity,  every  new  document  would  drive 
an  update  to  the  entity  profile,  strongly  suggesting  vitalness. 

Figure  [3] represents  staleness  of  an  entity  with  fluctuating 
activity  levels  in  the  stream  of  documents.  An  important 
event  involving  the  entity  starts  at  time  £—10  and  continues 
for  a  substantial  period,  showing  a  growing  trend  in  popu¬ 
larity.  At  the  beginning,  those  documents  can  be  considered 
vital,  but  as  documents  continue  commenting  on  the  same 
event  over  time,  the  information  starts  getting  stale,  clearly 
indicating  non- vitalness.  Near  £=40  when  the  event  is  over, 
a  steep  decrease  in  popularity  is  observed.  At  a  later  time, 
£  =  50,  a  new  event  occurs,  strongly  suggesting  vitalness. 


4.  Visualization  for  Accelerate  and  Create 


Intuitive  and  effective  visualization  techniques  can  provide 
valuable  tools  in  assisting  editors  to  populate  entity  profiles 
and  to  perform  exploratory  analysis  of  large  collections  of 
documents.  In  this  section,  we  describe  the  requirements 
of  such  visualization  tools  for  streaming  documents.  Then, 
we  present  our  visualization  prototype  for  the  Accelerate 
and  Create  task  that  enables  users  to  enlarge  parts  of  the 
visual  space  while  simultaneously  shrinking  the  context,  a 
technique  called  focus-plus-context  (|Silic  &  Basic||2010|). 


4.1.  Goals  and  Challenges 

Visual  exploration  of  text  streams  is  a  challenging  task. 
As  text  streams  continuously  evolve,  visualization  meth¬ 
ods  should  allow  tracing  the  temporal  evolution  of  existing 
topics,  detection  of  new  ones,  and  examination  of  the  rela¬ 
tionships  between  them.  Such  systems  should  also  allow 
users  to  interactively  change  the  information  they  are  seek¬ 
ing  at  any  time.  Interactivity  is  therefore  a  crucial  factor  in 
a  domain  where  users  do  not  know  the  text  documents  in 
advance  ([Alsakran  et  al.| |2012|) . 

In  this  work  we  intend  to  provide  an  easy-to-use  vizualiza- 
tion  that  enables  users  debug  what  is  going  on  in  the  sytem. 
We  provide  different  mechanisms  to  select  data  based  on 
users  interests;  in  particular  we  focus  our  attention  on  pro¬ 
viding  interactive  time-series  widgets. 


4.2.  Our  Implementation 

We  propose  a  browser-based  visualization  prototype  that 
enables  users  to  switch  between  multiple  entities  of  interest, 
select  the  time  ranges  to  explore  over,  explore  the  promi¬ 
nence  of  topics  over  time,  and  understand  the  topics  using 
lists  of  similar  words.  The  visualization  tool  initiates  with 
the  user  selecting  an  entity  of  interest  using  an  autocomplete- 
enabled  text-box.  For  the  selected  entity,  our  visualization 
consists  of  two  views:  Document  and  Topic. 


(a)  Predictions  Distribution  chart  example 


--  companies 
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experience  deal 
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(c)  Topic  cluster  example 

Figure  4:  Staleness  and  Topic  cluster  evolution 


The  Document  view  shows  the  distribution  of  vital  and  non- 
vital  documents  over  the  complete  timeline,  summarizing 
and  differentiating  the  time  frames  when  the  documents 
contain  a  non- vital  reference  to  the  entity,  and  when  they 
contain  vital  information.  Figure  [4a] illustrates  the  distribu¬ 
tion  of  predictions  of  entity  Kshama  Sawant  in  a  specific 
period  of  time.  Once  the  interesting  time  frames  have  been 
identified,  this  view  also  allows  the  user  to  navigate  to  and 
read  individual  documents. 

The  Topic  view  shows  the  evolution  of  the  topic  clusters 
for  the  entity,  illustrating  the  predicted  proportion  of  topic 
clusters  over  time.  This  view  primarily  plots  the  staleness 
of  a  cluster  over  time,  indicating  when  a  cluster  was  started, 
mentioned  in  the  documents,  and  fall  into  obsolescence.  The 
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user  can  also  study  the  topics  in  finer  detail;  clicking  on  any 
point  in  the  timeline  brings  up  a  word  cloud  representation 
of  the  topic  at  that  time.  Figure |4bJ  for  example,  shows  the 
staleness  evolution  for  the  different  topic  clusters  of  entity 
Mike  Kluse.  Figure  [4c]  is  the  result  of  a  user  click  on  the 
point  highlighted  in  Figure  [4b]  It  shows  the  closest  words 
to  the  topic  cluster  Cl  at  that  time. 

Both  views  consist  of  a  timeline  over  the  whole  stream, 
allowing  users  to  quickly  navigate  them  over  different  time 
frames.  The  timeline  is  an  active  (zoomed)  section  that 
can  be  changed  using  the  time  range  filters  located  below. 
Legends  also  act  as  filters,  users  have  the  option  of  observing 
specific  clusters  or  predictions  by  selecting  them  in  the 
legend.  These  interactive  time-series  controls  combined 
with  the  word  cloud  representations  allow  users  to  explore 
streaming  data  and  filter  information  based  on  their  needs. 


5.  Related  Work 


Several  knowledge  based  acceleration  competitions  have 
been  done  in  the  recent  past,  testifying  the  great  progress 
achieved  in  these  fields  (Gross  et  al.,|2012|).  Liu  &  Fang 
(2012)  present  one  of  the  best  performing  systems  in  TREC 
KBA  2012.  They  created  broader  representations  of  en¬ 
tity  profiles  based  on  a  Wikipedia  snapshot  and  considered 
the  anchor  text  of  all  internal  Wikipedia  links  as  related 
entities.  In  TREC  KBA  2013  competition,  different  fami¬ 
lies  of  methods  were  proposed,  including  query  expansion, 
classification,  and  learning  to  rank. 


Our  strategy  is  somewhat  similar  to | Wang  et  al.|(|2013)  in  the 
sense  that  we  first  target  a  high  recall  system  and  then  apply 
different  classification  methods  to  differentiate  between  vital 
and  non-vital  documents.  One  key  difference  is  that  we  do 
not  exploit  any  external  resources  to  construct  features,  e.g. 
we  do  not  use  Wikipedia  entity  pages  nor  existing  citations 
in  the  Wikipedia  page  of  an  entity. 

Representing  words  as  continuous  vectors  has  been  studied 
for  a  number  of  years  ( Hinton  et  al.  1987[  Elman  1990). 
The  progress  in  machine  learning  techniques  in  recent  years 
has  enabled  training  more  complex  models  on  much  larger 
data  sets  (Mikolov  et  al.[|2013a|).  One  popular  approach 
to  improving  accuracy  by  exploiting  large  datasets  is  to 
use  unsupervised  methods  to  create  word  features,  or  to 
download  word  features  that  have  already  been  produced 
([Turian  et  al.||2010|).  In  our  method,  we  do  the  latter,  using 
already  induced  word  embedding  features  in  order  to  im¬ 
prove  our  system  accuracy.  To  the  best  of  our  knowledge,  no 
other  technique  has  proposed  the  use  of  word  embeddings 
representations  for  the  vital  filtering  task. 


One  of  the  pioneering  work  on  detecting  novel  documents 
was  introduced  by  |Zhang  et  al.[  (|2QQ2|.  They  explicitly 
model  relevance  and  redundancy  as  separate  concepts.  They 


propose  different  redundancy  measures  and  empirically 
show  that  the  cosine  similarity  metric  is  effective  in  identi¬ 
fying  redundant  documents;  one  limitation  is  that  they  just 
keep  only  the  10  most  recent  documents  for  a  profile.  In  our 
method,  we  summarize  the  complete  history  of  documents 
for  a  given  entity,  which  allows  a  more  accurate  estimate  of 
the  query  document  redundancy.  Gamon  (]2006|)  addresses 
the  problem  of  staleness  detection  by  building  an  association 
graph  that  connects  sentences  and  sentence  fragments,  and 
uses  graph-based  features  as  indicators  of  lack  of  novelty. 
Though  the  task  is  somewhat  similar,  it  is  more  limited  in 
the  sense  that  they  do  not  need  to  model  the  transition  from 
new  to  background  information. 


Many  of  the  recent  approaches  have  focused  on  scaling 
novel  detection  algorithms,  also  known  as  First  Story  Detec¬ 
tion,  in  the  streaming  setting,  by  either  using  LSH  (Petrovic 
et  al.  ,|2010|)  or  just  employing  simple  heuristics  (Luo  et  al. 
2007  ).  While  their  work  mainly  focuses  on  efficiency  at 


the  cost  of  accuracy,  our  work  aims  to  achieve  accurate 
representations  without  compromising  efficiency. 


Streaming  document  filtering  is  also  related  to  several  other 
fields,  including  but  not  limited  to,  entity  linking  ( Ji  &  Gr- 
ishmanj[201 1[),  text  categorization  ([Kjersten  &  McNameef 


2012),  news  surveillance  (Steinberger  2014),  and  cross¬ 
document  coreference  (Rao  et  al![  2010}  Singh  et  al.  201 1|). 


6.  KBA  Vital  Filtering  Evaluation 

In  this  section  we  describe  the  TREC  KBA  Vital  filtering 
task,  and  our  evaluation  setup. 

6.1.  Data 

To  assess  our  contributions  we  use  TREC  KBA  2014  filtered 
stream  corpus.  The  filtered  corpus  contains  around  20M 
documents  annotated  with  BBN’s  Serif  NLP  tools,  including 
within-doc  coreference  and  dependency  parse  trees.  Further, 
we  use  the  7 1  target  entities  given  by  KBA  organizers  for 
the  Vital  Filtering  task.  Among  the  20M  documents,  around 
28K  have  truth  labels.  From  these  labeled  example,  8K  are 
treated  as  training  instances,  while  the  rest  as  test  instances. 
We  preprocess  the  corpus  to  retain  only  the  documents  that 
contain  exact  string  matches  to  the  target  entities  names, 
including  canonical  and  surface  form  names. 


6.2.  Features 


Our  approach  extends  the  classifier  introduced  by  Wang 


|et  al.|(|20T3|.  We  construct  a  basic  set  of  features  based  on 
the  document  and  the  entity  of  interest.  Using  our  repre¬ 
sentation,  we  include  additional  features  for  the  embedding, 
clustering,  and  staleness.  A  summary  of  the  features  we  use 
is  presented  in  Table  [T] 
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Basic  Features,  Fb 

Based  on  document  d 

log  (len{d)) 

log  of  the  length  of  d 

source(d) 

discretized  source  of  d 

Based  on  document  d  and  target  entity  e 

n(d ,  e) 

#  of  occurrences  of  target  entity  e  in  d 

n\d , ep) 

#  of  occurrences  of  partial  name  of  e  in  d 

fpos (d,  e) 

position  of  first  occurrence  of  entity  e  in  d 

fpos n(d,  e) 

fpos(d,  e)  normalized  by  document  length 

fpos(d,  ep) 

position  of  first  partial  occurrence  of  e  in  d 

fp°su(d,  ep) 

fpos (d,  ep )  normalized  by  document  length 

lpos(d,  e) 

position  of  last  occurrence  of  entity  e  in  d 

lposjd,  e) 

lpos(d,  e)  normalized  by  document  length 

lpos(d,  ep) 

position  of  last  partial  occurrence  of  entity  e  in  d 

lp°su(d, ep) 

lpos(d,  ep)  normalized  by  document  length 

spread (d,  e) 

lpos(d,  e)  —  fpos(d,  e) 

spreadn  (d,  e ) 

spread(d,  e)  normalized  by  document  length 

spread(d,  ep ) 

lpos(d,  ep)—  fpos(d,  ep ) 

spreadn(d,  ep ) 

spread(d,  ep )  normalized  by  document  length 

Embedding  Features,  Fe 

Based  on  a  single ,  combined  embedding,  Fec 

Vd 

mean  word  embedding  representation  of  d 

ZQY0(Vd) 

lVd=o,  set  to  1  if  Vd  is  0 

Based  on  POS  embeddings,  Fp 

Vdn 

mean  word  embedding  for  common  nouns 

ZQY0(vdn) 

lv  =0,  set  to  1  if  Vdn  is  0 

VdN 

mean  word  embedding  for  proper  nouns 

zero  (vdN) 

lVdjv=o,  set  to  1  if  vdN  is  0 

Vdv 

mean  word  embedding  for  verbs 

ZQY0(vdv) 

tvdv  =0,  set  to  1  if  Vdv  is  0 

Clustering  Features,  Fc 

uimc(vd,vc) 

minimum  distance  of  Vd  to  topic  clusters  of  e 

avg  c{vd,vc) 

average  distance  of  Vd  to  topic  clusters  of  e 

Temporal  Features,  Ft 

A(e) 

current  staleness  of  entity  e 

A(e,  c) 

current  staleness  of  topic  c  of  target  entity  e 

Table  1 :  Features  for  Vital  Filtering  classification 


6.3.  Relevance  Classification 

TREC  KB  A  2014  corpus  contains  documents  that  do  not 
refer  to  the  target  entities,  even  though  they  may  contain 
mentions  to  them.  We  therefore  need  to  use  a  non-referent 
category  of  documents.  A  non-referent  document  denotes 
that  it  does  not  refer  to  a  target  entity  or  the  context  is  so  am¬ 
biguous  that  it  is  impossible  to  decide  whether  the  mention 
refers  to  an  entity  or  not.  An  example  of  the  former  case 
is  “Barack  Ferrazzano  provides  a  wide  range  of  business- 
oriented  legal”.  It  clearly  does  not  refer  to  Barack  Obama. 
For  the  latter,  an  example  is  “Barack  is  a  great  father  and 
a  better  husband”.  The  mention  “Barack”  may  refer  to  any 
married  parent  named  Barack,  therefore,  we  consider  it  non¬ 
referent.  The  vital  and  non-vital  classes  described  in  section 
[2] fall  into  a  referent  (or  relevant)  category,  which  contains 
documents  that  refer  to  the  target  entities. 


NON-REFERENT 


Figure  5 :  Classification  process 


Due  to  the  fact  that  not  all  documents  in  the  corpus  refer 
to  the  target  entities,  we  include  an  extra  step  in  our  clas¬ 
sification  process,  as  shown  in  Figure  [5]  We  introduce  an 
additional  classifier,  called  rnr ,  which  is  trained  offline 
and  classifies  documents  as  referent  or  non-referent.  Con¬ 
sequently,  in  every  experiment,  each  document  goes  first 
through  the  rnr  classifier.  Only  the  referent  documents  out¬ 
putted  by  rnr  are  used  as  inputs  to  the  vnv  classifier,  which 
discriminates  between  vital  and  non-vital  documents,  the 
overall  focus  of  this  work. 


We  use  randomized  tree  ensembles  classifiers  fGeurts  et~aL] 


2006)  for  both  rnr  and  vnv,  each  composed  of  100  weak 


learners.  The  maximum  depth  of  each  tree  in  the  ensembles 
is  150.  All  our  experiments  use  the  same  rnr  model  trained 
with  the  basic  features  listed  in  section  [6^21  The  different 
methods  differ  in  the  features  used  in  the  vnv  classifier. 


6.4.  Methods 


We  evaluate  the  following  approaches  in  our  experiments. 
To  compare  against  existing  baselines,  we  use  just  the 
features  (Baseline) ;| Wang  et  al.|(|2013])  and|Bellogm  et  al. 
(  2013 )  propose  a  similar  technique,  although  they  train  their 
models  with  a  few  more  features.  We  also  include  an  addi¬ 
tional  baseline  that  uses  multi-task  learning  (|Caruana[|1993| 
to  learn  separate  parameters  for  each  entity,  called  Baseline, 
Multi-task.  In  order  to  evaluate  the  effect  of  adding  word 
embeddings,  we  introduce  two  extensions  to  the  baselines 
that  use  the  embedding  features:  Embedding,  Single  that 
uses  a  single  embedding  for  every  document  (F£  features), 
and  Embedding,  POS  that  maintains  different  embeddings 
for  common  nouns,  proper  nouns  and  verbs  (Ff  features); 
see  Section |3J~] for  details.  We  separately  evaluate  the  util¬ 
ity  of  temporal  modeling  via  staleness  by  introducing  the 
Staleness  only  method  that  includes  the  Ft  features.  Simi¬ 
larly,  we  propose  the  method  that  uses  only  clustering  ( Fc 
features),  but  not  the  temporal  ones,  called  Clustering  only. 
Finally,  the  approach  that  combines  all  contributions,  Com¬ 
bined,  includes  all  the  F Ff,  Ft,  and  Fc  features. 


7.  Results  and  Discussion 

Table  [2]  shows  the  submitted  and  revised  precision,  recall 
and  FI  results  of  the  methods  explained  in|6.4[  computed 
using  KB  A  scorer  tool,  using  the  2014-07-11  truth  data.  The 
models  that  include  the  Fc  features  use  a  =  0.8,  whereas 
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Model 

Features 

Vital  only,  micro 

Vital  only,  macro 

P 

R  FI 

P  R  FI 

Baseline 

Fb 

53.9 

26.6 

26.5 

63.4 

35.5 

37.5 

47.5 

36.6 

23.8 

94.0 

31.7 

52.7 

Baseline,  Multi-task 

Fb 

60.7 

60.7 

41.4 

41.4 

49.2 

49.2 

36.7 

36.6 

40.5 

94.0 

38.5 

52.7 

Embedding,  Single 

Fb 

+ 

FCe 

54.7 

54.7 

52.1 

51.8 

53.4 

53.2 

44.9 

38.3 

37.6 

85.5 

40.9 

52.9 

Embedding,  POS 

Fb 

+ 

FI 

53.9 

49.9 

46.4 

53.1 

49.8 

51.4 

44.0 

36.6 

32.9 

94.0 

37.6 

52.7 

Staleness  only 

Fb 

+ 

FI 

+ 

Ft 

57.3 

57.3 

48.3 

48.3 

52.4 

52.4 

47.5 

39.1 

33.8 

85.8 

39.5 

53.7 

Clustering  only 

Fb 

+ 

FI 

+ 

Fc 

57.0 

57.0 

49.0 

48.9 

52.7 

52.6 

46.4 

38.7 

34.2 

85.0 

39.4 

53.2 

Combined 

Fb 

+ 

FI 

+ 

Fc  +  Ft 

56.2 

56.2 

48.1 

48.1 

51.8 

51.8 

46.1 

36.6 

32.6 

94.0 

38.2 

52.7 

Table  2:  Vital  Filtering  performance  using  the  submitted  and  revised  runs  for  TREC  KB  A  2014 


(a)  Example  Submission  Curve  (b)  Example  Revised  Curve 


Figure  6:  Examples  of  P-R-Fl  over  confidence  cutoffs 


the  models  that  include  the  Ft  features  use  7 dec  =  1  and 

Tfinc  —  0.1. 

According  to  the  official  results,  our  submissions  achieved 
the  2nd  best  precision  in  the  competition,  but  performed 
poorly  in  the  overall  macro  FI  ( 8th  position).  Revisiting 
our  submission  files,  we  found  that  we  misinterpreted  the 
concept  of  confidence.  Figure [6a|  shows  that  we  only  make 
vital  predictions  with  confidence  greater  than  or  equal  to 
500,  i.e.  the  right  part  of  the  curve  is  just  constant.  We 
should  have  also  predicted  vital  with  low  confidence,  i.e. 
flip  our  high  confidence  non- vital  predictions  to  be  vital  with 
low  confidence.  That  minor  change  boosts  our  recall  (in 
most  cases),  while  the  precision  slightly  suffers,  as  shown 
in  Figure  |6b|  leaving  our  system  in  the  2nd  overall  position. 

The  baseline  provided  by  TREC  KBA  organizers  (not  to 
be  confused  with  our  Baseline  model)  assigns  a  vital  rating 
to  every  document  that  matches  a  surface  form  name  of  an 
entity,  assigning  a  confidence  score  based  on  the  number 
of  matches  of  tokens  in  the  name.  The  values  reported  by 
the  organizers  are:  macro-P=0.316,  macro-R=0.520,  macro- 
Fl=0.393,  SU=0.3334  ([Frank  et  aL[[20T4). 

Baseline  performs  as  expected,  i.e.  has  lower  FI  than  the 
other  models.  On  the  other  hand,  Baseline,  Multi-task  per¬ 
forms  far  better  than  Baseline ,  which  suggests  the  contexts 
vary  sufficiently  across  entities  to  benefit  from  separate 
parameterization.  Our  proposed  models  further  improve 
upon  the  Baseline,  Multi-task.  Further,  using  a  single  em- 


Figure  7:  Macro  P-R-F1-SU  over  confidence  cutoffs 


bedding  ( Embedding ,  Single)  outperforms  multiple  embed¬ 
dings  representations  ( Embedding ,  POS ),  indicating  word 
embeddings  implicitly  capture  the  various  parts  of  speech 
in  their  representation.  The  proposed  staleness  and  non- 
parametric  clustering  (Staleness  only ,  Clustering  only ,  Com¬ 
bined)  perform  slightly  worse  than  the  simple  Embedding, 
Single  method  on  the  submission  results.  However,  in  the 
revised  versions,  Staleness  only  scores  the  best  FI,  followed 
closely  by  Clustering  only ,  which  further  demonstrates  the 
importance  of  Ft  and  Fc  features. 

Figures  [7]  and [8] complement  the  revised  results  in  Table [2] 
for  different  confidence  cutoffs.  Figures  [7a]  and  |7b|  show 
that  the  macro  recall  has  a  substantial  increase  in  the  lower 
confidence  half  of  the  plot,  while  retaining  most  of  the 
precision;  this  results  in  a  boost  to  FI  scores  in  the  revised 
macro  scenario.  For  micro- averaging  on  the  other  hand,  the 
precisions  and  recalls  (shown  in  Figures  [8a| and [8b])  follow 
opposite  trends,  causing  the  micro  revised  FIs  to  be  almost 
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Figure  8:  Micro  P-R-F1-SU  over  confidence  cutoffs 


the  same  as  the  micro  submission  ones. 

Figures  |9|  and  [T0|  further  illustrate  the  precision-recall  for 
the  different  methods.  For  macro  averaging,  all  methods 
perform  almost  the  same.  The  micro  metrics  in  Figurep~Q|are 
much  more  interesting.  At  lower  recall,  the  micro  precisions 
of  the  different  models  meet  our  expectations:  the  more 
complex  methods,  including  non-parametric  clustering  and 
staleness,  outperform  our  baselines.  Nevertheless,  at  higher 
recall,  Embedding,  Single  takes  the  lead. 

8.  Conclusion  &  Future  Work 


Figure  9:  Macro  Precision-Recall 


Figure  10:  Micro  Precision-Recall 


Filtering  streaming  documents  in  order  to  fill  gaps  plays 
a  crucial  role  in  the  maintenance  and  timely  updates  of 
knowledge  bases.  With  the  exponential  increase  of  infor¬ 
mation  on  the  web,  it  becomes  critical  to  detect  relevant 
documents  and  incorporate  their  information  in  a  timely 
manner.  In  this  paper  we  introduced  a  semi- supervised 
learning  model  for  document  filtering  tasks.  We  proposed  a 
word  embeddings  based  non-parametric  representation  of 
documents  that  groups  entity  references  into  topic  clusters, 
and  is  suitable  for  streaming  data.  Further,  we  present  a 
notion  of  staleness  for  entities  and  topics  that  dynamically 
estimates  the  temporal  relevance  of  the  entity  contexts.  The 
combination  of  these  three  core  contributions  (distributed 
word  embedding  representations,  non-parametric  clustering, 
and  staleness)  results  in  an  accurate  representation  of  en¬ 
tity  contexts,  while  simultaneously  addressing  the  filtering 
requirements  of  large  corpora  of  streaming  text  documents. 

A  number  of  avenues  exist  for  further  work.  A  possible  line 
of  future  research  would  be  exploring  hierarchical  clustering 


algorithms  to  better  represent  topic  clusters.  It  would  also 
be  interesting  to  assess  the  effect  of  learning  the  hyperpa¬ 
rameters  of  the  model  instead  of  just  manual  tuning  them 
for  the  specific  datasets.  Utilizing  external  resources  such 
as  Wikipedia  entity  pages  to  construct  more  features 
|Fang||2012|)  will  likely  further  improve  the  accuracy 
method.  It  would  also  be  worthwhile  to  assess  the  effects  of 
using  different  pre-trained  word  embeddings. 
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